Basic Approach And Process Overview

To gain an initial understanding, an exploratory data analysis can be performed as a preliminary approach which involves interrogating, visualizing, and summarizing the main characteristics of a set of data to gain insights and identify patterns or trends. For forecasting, model can then be built using machine learning, which is a branch of artificial intelligence and computer science focussing on the use and development of data and algorithms to form a model to make prediction. Creating a model involves using a training (estimating the parameters of the machine learning method) and testing (evaluating the performance on unseen data) datasets to create an algorithm and validate the accuracy of the algorithm. ...

Exploratory Data Analysis

...

Machine Learning Models

The purpose of machine learning is to create a model which is able to make predictions for the regression or classification of target variables given a set of input variables.

In order to avoid overfitting or underfitting the model, it is necessary to evaluate the performance of the model on ...new... data which was not used to create the model. This is done by dividing the available data into a training dataset for creating the model (estimating the parameters of the method) and testing dataset for validating the accuracy of the model (evaluating the performance on unseen data). There are various ways in which this division can be performed and is dependent on the amount of the data available, but a general ...idea... is to sample around 80% to 60% of the data for training and holdout the remaining 20% to 40% of the data for testing. Another common ...idea... is to ... .

In addition, cross-validation ... . This essentially acts as a resampling method using different portions of the data to train and test a model on different iterations. In other words, instead of dividing the data in a training dataset and testing dataset for a single iteration, multiple iterations are performed, where different combinations of portions of the data are used for the training dataset and testing dataset. The results can then be summarized for the average measures of fitness in prediction, as every portion of the data is used in testing at some point. This ensures that there is avoidance of the chance of a ...unlucky... split being chosen for the division between the training dataset and testing dataset. This also allows for the comparison between different regression or classification methods to ...provide... a sense of their relative performance.

When performing cross-validation, the number of iterations, deriving from the number of combinations of portions of the data, is dependent on the size of the dataset, but at least 5 portions of the data are typically used (known as 5-fold cross-validation). In an extreme case, the individual samples within the dataset can be considered as the portions of the data, where a single sample is used as the testing dataset in each iteration with the rest of the samples being used for the training dataset (known as leave-one-out cross-validation). It is also possible to extend cross-validation in the estimation and optimization of tuning hyperparameters which are not intrinsic to the method or data.

Diagram of k-fold cross-validation: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/1920px-K-fold_cross_validation_EN.svg.png Example of leave-one-out cross-validation: https://en.wikipedia.org/wiki/File:LOOCV.gif

It should be noted that ...deep learning and neural networks... .

Model Performance Evaluation

...

Bias is ... .

The error is defined as the difference between an actual value and predicted value of the target variable. This allows for a distribution of the error to be formed for each value in the dataset. It can then be helpful to calculate various metrics of this distribution, such as the mean, median, skewness, ... . In order to ensure that a model is not overfit or underfit, the distribution of error and various metrics in the training dataset should be similar to the distribution of error and various metrics in the testing dataset. If the error is lower in the training dataset compared to the testing dataset, the model is overfit to the characteristics of the training dataset and is unable to capture the differences in the testing dataset. If the error is higher in the training dataset compared to the testing dataset, the model is underfit ... . The common definitions of error include the mean squared error, ... .

Mean squared error: https://en.wikipedia.org/wiki/Mean_squared_error MSE = \frac{1}{n} \sum_{i = 1}^{n} (y_i - \hat{y}_i)^{2}

This is related to the bias-variance tradeoff, where ...

Applied to a model, the coefficient of correlation can be used to measure the strength of the association between the predicted values and real values. ...

...

The goodness of the fit of a regression model can be evaluated through the coefficient of determination, as a metric of correlation with the proportion or fraction of variation in the response which is predictable from the covariates using the model. In other words, relative to the model, the coefficient of determination shows the degree to which the predicted values approximate the real values. A coefficient of determination of 1 implies that the predicted values perfectly fit the real values, while a coefficient of determination of 0 implies that the predicted values are the worst possible least-squares predictor (values outside the range 0 to 1 usually occur when nonsensical constraints are applied). It should be noted that the worst possible least-squares predictor is equivalent to a horizontal hyperplane at a height equal to the mean of the real values.

The adjusted coefficient of determination ... . This includes a penalty in the calculation for a ...penalization... as extra variables are included in the model. This can be necessary, as a drawback of the coefficient of determination is that it is likely to always increase as additional variables are included, even if these variables are arbitrary for the increases to only be due to chance. It should also be seen that the coefficient of determination and adjusted coefficient of determination offer easier interpretation and intuition than the coefficient of correlation when comparing values.

... :

\begin{gather*} R^2 = ... = 1 - \frac{\sum_{i = 1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i = 1}^{n} (y_i - \bar{y})^2} = 1 - \frac{\text{Var} (\bar{\hat{y}})}{\text{Var} (\bar{y})} \end{gather*}

... :

\begin{gather*} R_{adj}^2 = 1 - \frac{(1 - R^2)(N_{smp} - 1)}{N_{smp} - N_{cvr} - 1} \end{gather*}

Basic Regression Methods

For simple data with a single covariate and response, a basic ...prediction... model can be constructed by fitting a curve to the data. The function of the curve ... . The optimal parameters of this function for the best fit can be determined by assessing the resulting ...error... between a given function and each response value. It can then be concluded that the optimal parameters of this function for the best fit are found when the ...average error... satisfies specified criteria. A common method is to use the ...mean squared error... (to remove offsets from positive and negative residuals) by considering the squared difference between the given function and each response variable, accumulating these squared differences as the sum of the squared residuals, and comparing the resulting value to indicate the fitness (minimized to a point of inflection for the best fit).

Demonstration of fitting a linear function using the sum of the squared residuals (method of ordinary least-squares):

\begin{gather*} \hat{y}_i = \beta x_i + \alpha \rightarrow S = \sum_{i = 1}^{n} (\hat{y}_i - y_i)^{2} = \sum_{i = 1}^{n} ((\beta x_i + \alpha) - y_i)^{2} \rightarrow \text{Solve: } \frac{\partial S}{\partial \beta} = 0 \text{ and } \frac{\partial S}{\partial \alpha} = 0 \\ \therefore \text{With } \bar{y} = \alpha + \beta \bar{x},\, \beta = \frac{\sum_{i = 1}^{n} (y_i - \bar{y}) (x_i - \bar{x})}{\sum_{i = 1}^{n} (x_i - \bar{x})^2} = \frac{\text{Cov} (x, y)}{\text{Var} (x)} \rightarrow \alpha = \bar{y} - \beta \bar{x} \end{gather*}

Figure: Plot with many different linear curves. Plot with sum of the square residuals showing the minimum (3-dimensional with the slope by intercept by sum of square residuals).

This approach is a basic example of linear regression using a method of ordinary least-squares, where the fit of the function is linear and found in the form of a ... through a hyperplane with a closed-form solution. The performance of this model is ... when there is minimal uncertainty in the variation of the covariates relative to the response. Higher-order and polynomial functions ... (although the function is non-linear, the model is still linear as a statistical estimation in the response).

Definition of ... with a linear function:

\begin{gather*} y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_N x_N + \epsilon \end{gather*}

Definition of ... with a quadratic function:

\begin{gather*} y_i = \beta_0 + \beta_{11} x_1 + \beta_{12} x_1^2 + \beta_{21} x_2 + \beta_{22} x_2^2 + \cdots + \beta_{N1} x_{N} \beta_{N2} + x_{N}^2 + \epsilon \end{gather*}

For more complexity, ...non-linear least-squares... , which are usually solved by iterative refinement (although, at each iteration, the system is approximated by a linear system, so the core calculation is still similar to ordinary least-squares).

Basic Classification Methods

A decision tree is K-nearest neighbours Logistic regression Support vector machines